02 - R framework with IMPACT - session 2

Author

Yann Say

Published

May 7, 2024

library(cleaningtools)
library(dplyr)

my_raw_dataset <- cleaningtools::cleaningtools_raw_data
my_kobo_survey <- cleaningtools::cleaningtools_survey
my_kobo_choice <- cleaningtools::cleaningtools_choices

This section continues with the cleaning step.

Recap

more_logs <- my_raw_dataset %>% 
  check_duplicate(uuid_column = "X_uuid") %>% 
  check_soft_duplicates(uuid_column = "X_uuid", kobo_survey = my_kobo_survey, sm_separator = ".") %>%
  check_outliers(uuid_column = "X_uuid") %>%
  check_value(uuid_column = "X_uuid") 
more_logs$checked_dataset <- more_logs$checked_dataset %>% 
  add_duration(uuid_column = "X_uuid", start_column = "X.U.FEFF.start", end_column = "end")
more_logs <- more_logs %>% 
  check_duration(column_to_check = "duration", uuid_column = "X_uuid")
other_columns_to_check <- my_kobo_survey %>% 
  filter(type == "text") %>% 
  filter(name %in% names(my_raw_dataset)) %>%
  pull(name) 

more_logs <- more_logs %>% 
  check_others(uuid_column = "X_uuid", columns_to_check = other_columns_to_check) 
logical_check_list <- readxl::read_excel("inputs/01 - example - check_list.xlsx")
more_logs <- more_logs %>% 
  check_logical_with_list(uuid_column = "X_uuid",
                          list_of_check = logical_check_list,
                          check_id_column = "check_id",
                          check_to_perform_column = "check_to_perform",
                          columns_to_clean_column = "columns_to_clean",
                          description_column = "description")

Cleaning - Creating a cleaning log

create_*

create_* functions will create, transform something, e.g. creating a cleaning log with the checks to be filled, create analysis results table, create an output.

Outputs from create_* functions outputs can be in different shape, format, etc.

create_* function is catch-all.

create_combined_log

create_combined_log will combined all the logs from a list into one. It will also add 2 columns that will be used for the cleaning.

names(more_logs)
[1] "checked_dataset"    "duplicate_log"      "soft_duplicate_log"
[4] "potential_outliers" "flaged_value"       "duration_log"      
[7] "other_log"          "logical_all"       
my_combined_log <- create_combined_log(more_logs)
List of element to combine- checked_dataset, duplicate_log, soft_duplicate_log, potential_outliers, flaged_value, duration_log, other_log, logical_all
typeof(my_combined_log)
[1] "list"
names(my_combined_log)
[1] "checked_dataset" "cleaning_log"   
my_combined_log$cleaning_log %>% 
  head()
uuid old_value question issue check_id check_binding change_type new_value
b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 86 age_respondent_r outlier (normal distribution) NA age_respondent_r / b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 NA NA
956b5ed0-5a62-41b7-aec3-af93fbc5b494 84 age_respondent_r outlier (normal distribution) NA age_respondent_r / 956b5ed0-5a62-41b7-aec3-af93fbc5b494 NA NA
97ad6294-30c6-454e-a0b3-42126415b767 18 age_respondent_r outlier (log distribution) NA age_respondent_r / 97ad6294-30c6-454e-a0b3-42126415b767 NA NA
e005e719-57c4-44a3-ac2f-5d6d1ff68831 18 age_respondent_r outlier (log distribution) NA age_respondent_r / e005e719-57c4-44a3-ac2f-5d6d1ff68831 NA NA
c9aaa542-118f-4e42-93de-fb0916572541 19 num_hh_member outlier (normal distribution) NA num_hh_member / c9aaa542-118f-4e42-93de-fb0916572541 NA NA
48e8896b-d1be-4600-8839-2d8b994ebcfb 19 num_hh_member outlier (normal distribution) NA num_hh_member / 48e8896b-d1be-4600-8839-2d8b994ebcfb NA NA

The cleaning log contains all the columns from all the logs from more_logs with in addition:

  • check_binding is filled for all rows.
  • change_type (empty)
  • new_value (empty)

add_info_to_cleaning_log

If more information from the dataset should be added, the function add_info_to_cleaning_log can help.

add_*

add_* functions will add a variable (column) to the dataset. For example, to add the duration of a survey, to add the food consumption score category, etc.

add_* function takes a dataset as input and returns the dataset + the new indicator (and any intermediate steps used for the calculation).

For example, to check the duration of a survey, there is only the start and end, but not the duration column.

my_combined_log <- my_combined_log %>% 
  add_info_to_cleaning_log(dataset_uuid_column = "X_uuid", 
                           information_to_add = "enumerator_num")

my_combined_log$cleaning_log %>% 
  head()
uuid old_value question issue check_id check_binding change_type new_value enumerator_num
019bc718-c06a-46b8-bba8-c84f6c6efbd5 لا اعلم water_supply_other_neighbourhoods_why recode other NA water_supply_other_neighbourhoods_why / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 NA NA 12
019bc718-c06a-46b8-bba8-c84f6c6efbd5 لا ارى جدوى من ذلك prefer_not_engage_other recode other NA prefer_not_engage_other / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 NA NA 12
019bc718-c06a-46b8-bba8-c84f6c6efbd5 السلطات ليست مهتمة بالخدمات trust_water_office_why_not recode other NA trust_water_office_why_not / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 NA NA 12
019bc718-c06a-46b8-bba8-c84f6c6efbd5 247.20 duration Duration is lower or higher than the thresholds NA duration / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 NA NA 12
03183d24-0275-43fe-8976-d076f29de590 عدم توفير خدمه المياه في المنطقه وضعف في تزويد خدمه وعدد ساعات اقل تجهيز للبيوت water_supply_other_neighbourhoods_why recode other NA water_supply_other_neighbourhoods_why / 03183d24-0275-43fe-8976-d076f29de590 NA NA 2
03183d24-0275-43fe-8976-d076f29de590 10 pay_water_charges_amount outlier (log distribution) NA pay_water_charges_amount / 03183d24-0275-43fe-8976-d076f29de590 NA NA 2

create_xlsx_cleaning_log

create_xlsx_cleaning_log(my_combined_log,
                         sm_dropdown_type = "logical",
                         output_path =  "outputs/01 - example - cleaning-log-no-kobo.xlsx")

create_xlsx_cleaning_log will write an excel file with:

  • checked_dataset tab: the checked dataset, with additional columns if any.
  • cleaning_log tab: the combined log with the change_type column with a data validation rules.
  • readme tab: change_type values definition.

There are 4 actions possible:

  • change_response: Change the response to new.value
  • blank_response: Remove and NA the response
  • remove_survey: Delete the survey
  • no_action: No action to take.

This log will have to be filled in with actions to take and new value if needed.

create_xlsx_cleaning_log(my_combined_log,
                         kobo_survey = my_kobo_survey,
                         kobo_choices = my_kobo_choice,
                         use_dropdown = T,
                         sm_dropdown_type = "logical",
                         output_path =  "outputs/02 - example - cleaning-log-with-kobo.xlsx")

If the KOBO information are provided and the use_dropdown argument is set to TRUE, new_value will have a data validation rule based on the KOBO options.

Note

Select multiple dummy columns (TRUE/FALSE or 1/0) are flagged and used later for the cleaning, not the parent column.

Cleaning - Creating a clean dataset

Important

The cleaning has to be filled before moving forward. The above steps are to create the cleaning log, not to fill it or clean the dataset. Filling the cleaning log is not an automatic step for this process. The decision to change a value has to be recorded.

review_*

review_* functions will review an object by comparing it to standards or another object and flags differences, e.g. reviewing the cleaning by comparing the raw dataset, the clean dataset and the cleaning log, analysis comparing it with another analysis.

  • if the cleaning has been filled correctly
  • the cleaning has been done correctly
  • comparing indicators
  • comparing analysis
  • etc.

review_cleaning_log

review_cleaning_log will review the filled cleaning log.

my_filled_log <- readxl::read_excel("inputs/02 - example - cleaning-log-with-kobo - filled.xlsx", sheet = 2)

check_log_results <- review_cleaning_log(raw_dataset = my_raw_dataset,
                                        raw_data_uuid_column = "X_uuid",
                                        cleaning_log = my_filled_log, 
                                        cleaning_log_uuid_column = "uuid",
                                        cleaning_log_question_column = "question",
                                        cleaning_log_new_value_column = "new_value",
                                        cleaning_log_change_type_column = "change_type",
                                        change_response_value = "change_response")
check_log_results
[1] "no issues in cleaning log found"

create_clean_data

To create the clean dataset, create_clean_data will use the raw dataset and the filled cleaning log.

my_clean_data <- create_clean_data(raw_dataset = my_raw_dataset,
                                   raw_data_uuid_column = "X_uuid",
                                   cleaning_log = my_filled_log, 
                                   cleaning_log_uuid_column = "uuid",
                                   cleaning_log_question_column = "question",
                                   cleaning_log_new_value_column = "new_value",
                                   cleaning_log_change_type_column = "change_type")
[1] "water_supply_other_neighbourhoods_why"
[1] "trust_water_office_why_not"
[1] "pay_water_charges_amount"
[1] "connection_fees_amount"
[1] "connection_fees_amount"
[1] "primary_livelihood.employment"
[1] "primary_livelihood.employment"
[1] "primary_livelihood.employment"
[1] "tank_emptied"
[1] "access_water_enough"

recreate_parent_column

In the cleaning log, some select multiple are changed, but only the dummy.

my_filled_log %>% 
  filter(question == "primary_livelihood.employment", 
         change_type == "change_response") %>% 
  select(uuid, question, old_value, new_value)
uuid question old_value new_value
eb3b9935-fa0b-4d54-8058-3b629f1421ad primary_livelihood.employment TRUE FALSE
eec1f630-15d5-475e-a344-32bba74b32ea primary_livelihood.employment TRUE FALSE
f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb primary_livelihood.employment TRUE FALSE

The parent column did not change, in the following example employment still appears in the parent column.

my_clean_data %>% 
  filter(X_uuid %in% c("eb3b9935-fa0b-4d54-8058-3b629f1421ad", "eec1f630-15d5-475e-a344-32bba74b32ea", "f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb")) %>% 
  select(X_uuid,primary_livelihood, primary_livelihood.employment)
X_uuid primary_livelihood primary_livelihood.employment
f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb employment FALSE
eb3b9935-fa0b-4d54-8058-3b629f1421ad employment ngo FALSE
eec1f630-15d5-475e-a344-32bba74b32ea employment FALSE

recreate_parent_column will recode the parent columns based on the dummies.

my_clean_data2 <- recreate_parent_column(dataset = my_clean_data,
                                         uuid_column = "X_uuid",
                                         kobo_survey = my_kobo_survey,
                                         kobo_choices = my_kobo_choice,
                                         sm_separator = ".", 
                                         cleaning_log_to_append = my_filled_log)

The parent are corrected, employment does not appear in the parent column.

my_clean_data2$data_with_fix_concat %>% 
  filter(X_uuid %in% c("eb3b9935-fa0b-4d54-8058-3b629f1421ad", "eec1f630-15d5-475e-a344-32bba74b32ea", "f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb")) %>% 
  select(X_uuid,primary_livelihood, primary_livelihood.employment)
X_uuid primary_livelihood primary_livelihood.employment
f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb NA FALSE
eb3b9935-fa0b-4d54-8058-3b629f1421ad ngo FALSE
eec1f630-15d5-475e-a344-32bba74b32ea NA FALSE

The changes were added to the cleaning log.

my_clean_data2$cleaning_log %>% 
  filter(question == "primary_livelihood", 
         uuid %in% c("eb3b9935-fa0b-4d54-8058-3b629f1421ad", "eec1f630-15d5-475e-a344-32bba74b32ea", "f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb")) %>% 
  select(uuid, question, old_value, new_value)
uuid question old_value new_value
eb3b9935-fa0b-4d54-8058-3b629f1421ad primary_livelihood employment ngo ngo
eec1f630-15d5-475e-a344-32bba74b32ea primary_livelihood employment NA
f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb primary_livelihood employment NA

review_others

In the cleaning log, some opentext values are changed to blank. Some open text questions are linked some skip logic, i.e. what is X? Other, please specify. In some cases, values some values should be changed.

In the example below, the value for water_supply_other_neighbourhoods_why for the uuid 019bc718-c06a-46b8-bba8-c84f6c6efbd5 was changed to NA.

my_filled_log %>% 
  filter(question == "water_supply_other_neighbourhoods_why", 
         change_type == "blank_response")
uuid old_value question issue check_id check_binding change_type new_value enumerator_num
019bc718-c06a-46b8-bba8-c84f6c6efbd5 لا اعلم water_supply_other_neighbourhoods_why recode other NA water_supply_other_neighbourhoods_why / 019bc718-c06a-46b8-bba8-c84f6c6efbd5 blank_response NA 12

The kobo show a skip logic based on water_supply_other_neighbourhoods.

my_kobo_survey %>% 
  filter(name == "water_supply_other_neighbourhoods_why") %>% 
  select(type, name, relevant)
type name relevant
text water_supply_other_neighbourhoods_why selected(\({water_supply_other_neighbourhoods},'somewhat_worse') or selected(\){water_supply_other_neighbourhoods},‘much_worse’)
my_clean_data %>% 
  filter(X_uuid == "019bc718-c06a-46b8-bba8-c84f6c6efbd5") %>% 
  select(water_supply_other_neighbourhoods, water_supply_other_neighbourhoods_why   )
water_supply_other_neighbourhoods water_supply_other_neighbourhoods_why
somewhat_worse NA

Should the value of water_supply_other_neighbourhoods be changed? It depends on the question and skip logic but it important to flag those so a decision can be taken.

review_other_log <- review_others(dataset = my_clean_data2$data_with_fix_concat,
                                  uuid_column = "X_uuid", 
                                  kobo_survey = my_kobo_survey, 
                                  columns_not_to_check = "consent_telephone_number")
Warning in create_logic_for_other(kobo_survey = kobo_survey,
compare_with_dataset = TRUE, : The following parent names: well_quality,
spring_quality, rainwater_quality, surface_quality, why_not_connected were not
found in the dataset. The function is ignoring them.

review_cleaning

my_deletion_log <- my_clean_data2$cleaning_log %>% 
  filter(change_type == "remove_survey")

my_filled_log_no_deletion <- my_clean_data2$cleaning_log %>% 
  filter(change_type != "remove_survey") %>% 
  filter(!uuid %in% my_deletion_log$uuid)

review_of_cleaning <- review_cleaning(raw_dataset = my_raw_dataset,
                    raw_dataset_uuid_column = "X_uuid", 
                    clean_dataset = my_clean_data2$data_with_fix_concat,
                    clean_dataset_uuid_column = "X_uuid",
                    cleaning_log = my_filled_log_no_deletion, 
                    cleaning_log_uuid_column = "uuid",
                    cleaning_log_question_column = "question",
                    cleaning_log_new_value_column = "new_value",
                    cleaning_log_change_type_column = "change_type", 
                    cleaning_log_old_value_column = "old_value", 
                    deletion_log = my_deletion_log, 
                    deletion_log_uuid_column = "uuid"
                    )

Exercises

Exercise 1

  • Export the cleaning log you have created previously. The previous log is loaded below.
previous_exercise_log <- readRDS("inputs/03 - exercise - previous_log.RDS")

previous_exercise_log %>% names()
[1] "checked_dataset"        "percentage_missing_log" "potential_PII"         
[4] "logical_all"           

Did you try the function create_combined_log

Did you try the function create_xlsx_cleaning_log

previous_exercise_log %>% 
  create_combined_log() %>%
  create_xlsx_cleaning_log(output_path = "outputs/03 - correction - cleaning_log.xlsx", 
                           kobo_survey = my_kobo_survey,
                           kobo_choices = my_kobo_choice,
                           sm_dropdown_type = "logical",
                           use_dropdown = TRUE)

Exercise 2

  • Create the clean data from the raw dataset and the filled cleaning.
exercise_filled_log <- readxl::read_excel("inputs/04 - exercise - cleaning_log - filled.xlsx", sheet = "cleaning_log")

Did you try the function create_clean_data

Did you try the function recreate_parent_column

exercise_clean_dataset <- create_clean_data(raw_dataset = my_raw_dataset,
                                            raw_data_uuid_column = "X_uuid",
                                            cleaning_log = exercise_filled_log, 
                                            cleaning_log_uuid_column = "uuid",
                                            cleaning_log_question_column = "question",
                                            cleaning_log_new_value_column = "new_value",
                                            cleaning_log_change_type_column = "change_type")


exercise_clean_dataset2 <- recreate_parent_column(exercise_clean_dataset,
                                                  uuid_column = "X_uuid", 
                                                  kobo_survey = my_kobo_survey,
                                                  kobo_choices = my_kobo_choice,
                                                  cleaning_log_to_append = exercise_filled_log)

Exercise 3

  • Review the cleaning below, if there is someone else doing the exercise, you can try to review someone’s cleaning.
exercise3_clean_dataset <- readxl::read_excel("inputs/05 - exercise - clean dataset for review.xlsx")

exercise3_cleaning_log <- readxl::read_excel("inputs/05 - exercise - clean dataset for review.xlsx", sheet = 2)

Did you try the function review_cleaning

Did you separate the cleaning log?

exercise3_deletion_log <- exercise3_cleaning_log %>% 
  filter(change_type == "remove_survey")

exercise3_log_no_deletion <- exercise3_cleaning_log %>% 
  filter(change_type != "remove_survey") %>% 
  filter(!uuid %in% exercise3_deletion_log$uuid)

review_of_cleaning <- review_cleaning(raw_dataset = my_raw_dataset,
                                      raw_dataset_uuid_column = "X_uuid", 
                                      clean_dataset = exercise3_clean_dataset,
                                      clean_dataset_uuid_column = "X_uuid",
                                      cleaning_log = exercise3_log_no_deletion, 
                                      cleaning_log_uuid_column = "uuid",
                                      cleaning_log_question_column = "question",
                                      cleaning_log_new_value_column = "new_value",
                                      cleaning_log_change_type_column = "change_type", 
                                      cleaning_log_old_value_column = "old_value", 
                                      deletion_log = exercise3_deletion_log, 
                                      deletion_log_uuid_column = "uuid"
)

Review templates

There are some project templates that can be used to review the cleaning. More information on this repository.

Composition - adding indicators

The framework is built around 4 steps: cleaning, composition, analysis, outputs

  • Cleaning: any manipulation to go from the raw data to the clean data
  • Composition: any manipulation before the analysis e.g. adding indicators, adding information from loop or main, aok aggregation, etc.
  • Analysis: any manipulation regarding only the analysis
  • Outputs: any manipulation to format the outputs.

The following section will present some introduction about the composition.

library(addindicators)
library(dplyr)

my_data <- addindicators::addindicators_MSNA_template_data
add_*

add_* functions will add a variable (column) to the dataset. For example, to add the duration of a survey, to add the food consumption score category, etc.

add_* function takes a dataset as input and returns the dataset + the new indicator (and any intermediate steps used for the calculation).

For example, to check the duration of a survey, there is only the start and end, but not the duration column.

With addindicators some intermediate columns can be added if they are used to create the new indicator.

add_fcs

my_data_with_fcs <- my_data %>% add_fcs(
  cutoffs = "normal",
  fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
  fsl_fcs_legumes = "fs_fcs_beans_nuts",
  fsl_fcs_veg = "fs_fcs_vegetables_leaves",
  fsl_fcs_fruit = "fs_fcs_fruit",
  fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
  fsl_fcs_dairy = "fs_fcs_dairy",
  fsl_fcs_sugar = "fs_fcs_sugar",
  fsl_fcs_oil = "fs_fcs_oil_fat_butter"
)

my_data_with_fcs[, tail(names(my_data_with_fcs), 10)] %>%
  head()
fcs_weight_cereal1 fcs_weight_legume2 fcs_weight_dairy3 fcs_weight_meat4 fcs_weight_veg5 fcs_weight_fruit6 fcs_weight_oil7 fcs_weight_sugar8 fsl_fcs_score fsl_fcs_cat
0 0 8 28 5 6 1.0 0.5 48.5 Acceptable
4 6 12 16 7 2 2.5 1.0 50.5 Acceptable
8 0 0 24 7 6 1.0 0.5 46.5 Acceptable
14 3 4 12 4 2 1.0 2.5 42.5 Acceptable
6 6 4 20 6 7 3.5 3.0 55.5 Acceptable
0 6 16 4 7 1 0.0 3.5 37.5 Acceptable
Note

You can learn more about food security indicators here.

add_hhs

Pipe-able

The framework is built around 2 adjectives, pipe-able and independent. In the framework, functions of the same family should be pipe-able. In the following case, 2 add_* functions are piped.

my_data_with_indicators <- my_data %>%
  add_fcs(
  cutoffs = "normal",
  fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
  fsl_fcs_legumes = "fs_fcs_beans_nuts",
  fsl_fcs_veg = "fs_fcs_vegetables_leaves",
  fsl_fcs_fruit = "fs_fcs_fruit",
  fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
  fsl_fcs_dairy = "fs_fcs_dairy",
  fsl_fcs_sugar = "fs_fcs_sugar",
  fsl_fcs_oil = "fs_fcs_oil_fat_butter"
  ) %>%
  add_hhs(
  )

my_data_with_indicators[, tail(names(my_data_with_indicators), 14)] %>%
  head()
fsl_fcs_score fsl_fcs_cat fs_hhs_nofood_yn_recoded fs_hhs_nofood_freq_recoded fs_hhs_sleephungry_yn_recoded fs_hhs_sleephungry_freq_recoded fs_hhs_daynoteating_yn_recoded fs_hhs_daynoteating_freq_recoded hhs_comp1 hhs_comp2 hhs_comp3 hhs_score hhs_cat_ipc hhs_cat
48.5 Acceptable 0 0 0 0 0 0 0 0 0 0 None No or Little
50.5 Acceptable 1 1 0 0 0 1 1 0 0 1 Little No or Little
46.5 Acceptable 0 0 1 2 1 0 0 2 0 2 Moderate Moderate
42.5 Acceptable 0 0 0 1 1 0 0 0 0 0 None No or Little
55.5 Acceptable 0 0 0 1 1 0 0 0 0 0 None No or Little
37.5 Acceptable 1 1 0 2 1 2 1 0 2 3 Moderate Moderate

Composition - reviewing indicators

Reviewing indicators will compare 2 indicators together and present the differences. It will not check how the indicator was created nor check for inconsistencies. That mean, to review an indicator, it is necessary to create one and compare them. The functions review_one_variable and review_variables will focus on the latter.

review_variables

First, a new dataset can be created for the review.

review_df <- addindicators_MSNA_template_data %>%
  add_fcs(
  cutoffs = "normal",
  fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
  fsl_fcs_legumes = "fs_fcs_beans_nuts",
  fsl_fcs_veg = "fs_fcs_vegetables_leaves",
  fsl_fcs_fruit = "fs_fcs_fruit",
  fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
  fsl_fcs_dairy = "fs_fcs_dairy",
  fsl_fcs_sugar = "fs_fcs_sugar",
  fsl_fcs_oil = "fs_fcs_oil_fat_butter"
  ) %>%
  select(uuid, fsl_fcs_score, fsl_fcs_cat)

Then the dataset to be reviewed and the new dataset can be binded together.

binded_df <- my_data_with_indicators %>%
  full_join(review_df, by = "uuid")
Note

I would advice to use a full_join rather than a left/right_join. That way if any computation has missing value they will be spotted.

Note

With the join_* if the names are the same .x and .y will added to the names.

review_*

review_* functions will review an object by comparing it to standards or another object and flags differences, e.g. reviewing the cleaning by comparing the raw dataset, the clean dataset and the cleaning log, analysis comparing it with another analysis.

review_one_var <- review_variables(binded_df,
  columns_to_review = "fsl_fcs_cat.x",
  columns_to_compare_with = "fsl_fcs_cat.y")


review_one_var %>% 
  names()
[1] "dataset"      "review_table"

It is a list with the dataset and a review table.

review_one_var$review_table %>% 
  head()
uuid variable review_check review_comment
eaf540cd-32bd-41474b-b4beb5-d62fc987e45a fsl_fcs_cat.x TRUE Same results
89e706c3-53d8-4a4049-898586-4926085db71e fsl_fcs_cat.x TRUE Same results
afd921c6-e54a-4c4740-919c93-87f59bd0e63a fsl_fcs_cat.x TRUE Same results
d8b05f39-ba85-494c4d-808c84-9dc57823a4f1 fsl_fcs_cat.x TRUE Same results
d6b42f9e-c209-4c4541-808a81-86bea53df142 fsl_fcs_cat.x TRUE Same results
f1b9ec67-20db-47404d-a3ada0-1a37e5c49d02 fsl_fcs_cat.x TRUE Same results

The review table can be summarised to have a quicker overview.

review_one_var$review_table %>%
  group_by(review_check, review_comment) %>%
  tally()
review_check review_comment n
TRUE Same results 100

To see how differences are shown, some noise is introduced to the dataset.

jittered_df <- binded_df
set.seed(123)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_cat.x"] <- sample(unique(jittered_df$fsl_fcs_cat.y), 5, T)
set.seed(124)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_cat.y"] <- sample(unique(jittered_df$fsl_fcs_cat.y), 5, T)
set.seed(125)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_cat.x"] <- NA
set.seed(1236)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_cat.y"] <- NA
set.seed(1237)
jittered_df[sample(1:nrow(jittered_df), 5), "fsl_fcs_score.x"] <- sample(unique(jittered_df$fsl_fcs_score.x), 5, T)
review_one_variable_jittered <- review_variables(jittered_df,
  columns_to_review = "fsl_fcs_cat.x",
  columns_to_compare_with = "fsl_fcs_cat.y")

review_one_variable_jittered$review_table %>%
  group_by(review_check, review_comment) %>%
  tally()
review_check review_comment n
FALSE Different results 9
FALSE Missing in fsl_fcs_cat.x 5
FALSE Missing in fsl_fcs_cat.y 5
TRUE Same results 81

The dataset has new columns to help filtering for further investigation.

review_one_variable_jittered$dataset[, tail(names(review_one_variable_jittered$dataset), 5)] %>%
  head()
hhs_cat fsl_fcs_score.y fsl_fcs_cat.y review_check_fsl_fcs_cat.x review_comment_fsl_fcs_cat.x
No or Little 48.5 Acceptable TRUE Same results
No or Little 50.5 Acceptable TRUE Same results
Moderate 46.5 Acceptable TRUE Same results
No or Little 42.5 Acceptable TRUE Same results
No or Little 55.5 Acceptable TRUE Same results
Moderate 37.5 Poor FALSE Different results
review_one_variable_jittered$dataset %>%
  filter(!review_check_fsl_fcs_cat.x) %>%
  select(uuid, fsl_fcs_cat.x, fsl_fcs_cat.y, review_check_fsl_fcs_cat.x, review_comment_fsl_fcs_cat.x)
uuid fsl_fcs_cat.x fsl_fcs_cat.y review_check_fsl_fcs_cat.x review_comment_fsl_fcs_cat.x
f1b9ec67-20db-47404d-a3ada0-1a37e5c49d02 Acceptable Poor FALSE Different results
42dc8573-e2d0-43484b-aaada2-c37ef865d041 Borderline Acceptable FALSE Different results
fcd69a08-498f-4c4b47-989799-743cbe5fd960 NA Acceptable FALSE Missing in fsl_fcs_cat.x
d36b8cfa-bf52-48434f-8b8d88-ef620d73c941 Poor Acceptable FALSE Different results
72095b68-8c51-484245-929b97-360eda2c4b81 NA Acceptable FALSE Missing in fsl_fcs_cat.x
675eb3d0-62ba-4e4b42-a3a4af-08f14dce6592 Poor Acceptable FALSE Different results
952efab8-4a32-4a4743-bcb4b8-d9a534fb8627 Poor Acceptable FALSE Different results
01648f7a-8521-4d4347-afa3a7-7fd83ea05916 Acceptable Borderline FALSE Different results
10db47e8-a721-49454a-888385-59068d4a3ef2 Poor Borderline FALSE Different results
46c81eb3-7243-414f40-919e98-270439d8fbc5 NA Acceptable FALSE Missing in fsl_fcs_cat.x
765d2341-5df2-43484a-8e888b-362fa085de17 Acceptable NA FALSE Missing in fsl_fcs_cat.y
0dea8527-2ab9-4e4844-88868e-8379e514ca2b NA Acceptable FALSE Missing in fsl_fcs_cat.x
da823c6f-d215-43474c-b4b0b6-6ba519748c0f NA Acceptable FALSE Missing in fsl_fcs_cat.x
e76f4382-b3ea-4b404e-bebfb0-a6b2d7594c18 Acceptable Poor FALSE Different results
42698e10-7e19-4d4744-8d8b8c-04d187cb62ae Borderline NA FALSE Missing in fsl_fcs_cat.y
bdc23a6e-1a35-49474b-949e9b-57bf01284ea9 Acceptable NA FALSE Missing in fsl_fcs_cat.y
2bd1809c-a2c1-424b44-b7b0b4-9a53b24d0e67 Borderline Acceptable FALSE Different results
de416c95-845f-4d4f40-868f8c-e12b3946ad07 Acceptable NA FALSE Missing in fsl_fcs_cat.y
31f4e76d-c64e-4b4144-bbb1bf-b05ca69d823e Acceptable NA FALSE Missing in fsl_fcs_cat.y

If there are more than one variable to review, pair-wise vectors can be used.

my_review <- review_variables(jittered_df,
  columns_to_review = c("fsl_fcs_cat.x", "fsl_fcs_score.x"),
  columns_to_compare_with = c("fsl_fcs_cat.y", "fsl_fcs_score.y")
)
my_review$review_table %>%
  group_by(variable, review_check, review_comment) %>%
  tally()
variable review_check review_comment n
fsl_fcs_cat.x FALSE Different results 9
fsl_fcs_cat.x FALSE Missing in fsl_fcs_cat.x 5
fsl_fcs_cat.x FALSE Missing in fsl_fcs_cat.y 5
fsl_fcs_cat.x TRUE Same results 81
fsl_fcs_score.x FALSE Different results 5
fsl_fcs_score.x TRUE Same results 95
my_review$dataset %>%
  filter(!review_check_fsl_fcs_cat.x) %>%
  select(uuid, fsl_fcs_cat.x, fsl_fcs_cat.y, review_comment_fsl_fcs_cat.x)
uuid fsl_fcs_cat.x fsl_fcs_cat.y review_comment_fsl_fcs_cat.x
f1b9ec67-20db-47404d-a3ada0-1a37e5c49d02 Acceptable Poor Different results
42dc8573-e2d0-43484b-aaada2-c37ef865d041 Borderline Acceptable Different results
fcd69a08-498f-4c4b47-989799-743cbe5fd960 NA Acceptable Missing in fsl_fcs_cat.x
d36b8cfa-bf52-48434f-8b8d88-ef620d73c941 Poor Acceptable Different results
72095b68-8c51-484245-929b97-360eda2c4b81 NA Acceptable Missing in fsl_fcs_cat.x
675eb3d0-62ba-4e4b42-a3a4af-08f14dce6592 Poor Acceptable Different results
952efab8-4a32-4a4743-bcb4b8-d9a534fb8627 Poor Acceptable Different results
01648f7a-8521-4d4347-afa3a7-7fd83ea05916 Acceptable Borderline Different results
10db47e8-a721-49454a-888385-59068d4a3ef2 Poor Borderline Different results
46c81eb3-7243-414f40-919e98-270439d8fbc5 NA Acceptable Missing in fsl_fcs_cat.x
765d2341-5df2-43484a-8e888b-362fa085de17 Acceptable NA Missing in fsl_fcs_cat.y
0dea8527-2ab9-4e4844-88868e-8379e514ca2b NA Acceptable Missing in fsl_fcs_cat.x
da823c6f-d215-43474c-b4b0b6-6ba519748c0f NA Acceptable Missing in fsl_fcs_cat.x
e76f4382-b3ea-4b404e-bebfb0-a6b2d7594c18 Acceptable Poor Different results
42698e10-7e19-4d4744-8d8b8c-04d187cb62ae Borderline NA Missing in fsl_fcs_cat.y
bdc23a6e-1a35-49474b-949e9b-57bf01284ea9 Acceptable NA Missing in fsl_fcs_cat.y
2bd1809c-a2c1-424b44-b7b0b4-9a53b24d0e67 Borderline Acceptable Different results
de416c95-845f-4d4f40-868f8c-e12b3946ad07 Acceptable NA Missing in fsl_fcs_cat.y
31f4e76d-c64e-4b4144-bbb1bf-b05ca69d823e Acceptable NA Missing in fsl_fcs_cat.y
my_review$dataset %>%
  filter(!review_check_fsl_fcs_score.x) %>%
  select(uuid, fsl_fcs_score.x, fsl_fcs_score.y, review_comment_fsl_fcs_score.x)
uuid fsl_fcs_score.x fsl_fcs_score.y review_comment_fsl_fcs_score.x
afd921c6-e54a-4c4740-919c93-87f59bd0e63a 87.0 46.5 Different results
c14529b6-06f7-4d4446-a1a7ac-b3648529ef7d 61.0 78.5 Different results
4a78b1d6-ad91-4b4f45-aba5a0-bed8cf257106 41.5 30.0 Different results
26bce981-8217-464d45-979a96-8a69e5371b02 76.5 47.5 Different results
ad31fe62-5a3c-484d49-b7b8bc-a79d8304be65 37.5 42.5 Different results

Exercises

Exercise 1

  • Add the food consumption matrix score to the dataset. The food consumption matrix score is a food security indicator that uses the food consumption score, household hunger score and the reduced coping strategy index.
name label::english type
rCSILessQlty During the last 7 days, were there days (and, if so, how many) when your household had to rely on less preferred and less expensive food to cope with a lack of food or money to buy it? integer
rCSIBorrow During the last 7 days, were there days (and, if so, how many) when your household had to borrow food or rely on help from a relative or friend to cope with a lack of food or money to buy it? integer
rCSIMealSize During the last 7 days, were there days (and, if so, how many) when your household had to limit portion size of meals at meal times to cope with a lack of food or money to buy it? integer
rCSIMealAdult During the last 7 days, were there days (and, if so, how many) when your household had to restrict consumption by adults in order for small children to eat to cope with a lack of food or money to buy it? integer
rCSIMealNb During the last 7 days, were there days (and, if so, how many) when your household had to reduce number of meals eaten in a day to cope with a lack of food or money to buy it? integer
library(addindicators)
library(dplyr)
exercise_data <- addindicators_MSNA_template_data %>%
  add_fcs(
  cutoffs = "normal",
  fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
  fsl_fcs_legumes = "fs_fcs_beans_nuts",
  fsl_fcs_veg = "fs_fcs_vegetables_leaves",
  fsl_fcs_fruit = "fs_fcs_fruit",
  fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
  fsl_fcs_dairy = "fs_fcs_dairy",
  fsl_fcs_sugar = "fs_fcs_sugar",
  fsl_fcs_oil = "fs_fcs_oil_fat_butter"
  ) %>%
  add_hhs(
  )

Did you try the function add_fcm_phase?

The food consumption matrix needs 3 indicators, FCS, rCSI, HHS.

Have you used the correct HHS category variable?

my_answer <- exercise_data %>% add_rcsi(
  ) %>%
  add_fcm_phase(
    fcs_column_name = "fsl_fcs_cat",
    rcsi_column_name = "rcsi_cat",
    hhs_column_name = "hhs_cat_ipc",
    fcs_categories_acceptable = "Acceptable",
    fcs_categories_poor = "Poor",
    fcs_categories_borderline = "Borderline",
    rcsi_categories_low = "No to Low",
    rcsi_categories_medium = "Medium",
    rcsi_categories_high = "High",
    hhs_categories_none = "None",
    hhs_categories_little = "Little",
    hhs_categories_moderate = "Moderate",
    hhs_categories_severe = "Severe",
    hhs_categories_very_severe = "Very Severe"
  )

Exercise 2

  • You receive a dataset, you need to review the following four indicators.

    • Food Consumption Score: fcs_score, fcs_cat
    • Household Hunger Score: hhs_score, hhs_cat

Don’t forget to write the review.

dataset_to_review <- read.csv("inputs/06 - exercise - dataset_to_review.csv")

dataset_without_indicators <- addindicators::addindicators_MSNA_template_data

Did you try the function review_variables

How was the FSC created?

How was coded the category for the HHS?

my_review <- dataset_without_indicators %>% 
    add_fcs(
  cutoffs = "normal",
  fsl_fcs_cereal = "fs_fcs_cereals_grains_roots_tubers",
  fsl_fcs_legumes = "fs_fcs_beans_nuts",
  fsl_fcs_veg = "fs_fcs_vegetables_leaves",
  fsl_fcs_fruit = "fs_fcs_fruit",
  fsl_fcs_meat = "fs_fcs_meat_fish_eggs",
  fsl_fcs_dairy = "fs_fcs_dairy",
  fsl_fcs_sugar = "fs_fcs_sugar",
  fsl_fcs_oil = "fs_fcs_oil_fat_butter"
  )  %>% add_hhs(
    hhs_nofoodhh_1 = "fs_hhs_nofood_yn",
    hhs_nofoodhh_1a = "fs_hhs_nofood_freq",
    hhs_sleephungry_2 = "fs_hhs_sleephungry_yn",
    hhs_sleephungry_2a = "fs_hhs_sleephungry_freq",
    hhs_alldaynight_3 = "fs_hhs_daynoteating_yn",
    hhs_alldaynight_3a = "fs_hhs_daynoteating_freq",
    yes_answer = "yes",
    no_answer = "no",
    rarely_answer = "rarely_1_2",
    sometimes_answer = "sometimes_3_10",
    often_answer = "often_10_times"
  ) %>% 
  select(uuid, fsl_fcs_cat, fsl_fcs_score, hhs_cat, hhs_score)
dataset_to_review <- full_join(dataset_to_review, my_review, by = "uuid")

review <- dataset_to_review %>% 
  review_variables(columns_to_review = c("fsl_fcs_cat.x", "fsl_fcs_score.x", "hhs_cat.x", "hhs_score.x"),
                   columns_to_compare_with = c("fsl_fcs_cat.y", "fsl_fcs_score.y", "hhs_cat.y", "hhs_score.y"))

review$review_table %>% 
  group_by(variable,review_check,review_comment) %>% 
  tally()
  • There are 10 fcs categories that are different.
  • There are 100 HHS categories that are different
review$dataset %>% 
  filter(!review_check_fsl_fcs_cat.x) %>% 
  select(uuid, review_comment_fsl_fcs_cat.x, fsl_fcs_score.x, fsl_fcs_cat.x, fsl_fcs_cat.y)
uuid review_comment_fsl_fcs_cat.x fsl_fcs_score.x fsl_fcs_cat.x fsl_fcs_cat.y
f1b9ec67-20db-47404d-a3ada0-1a37e5c49d02 Different results 37.5 Borderline Acceptable
e21a34f5-1a46-42404b-b7b6be-7bc9286d0f13 Different results 36.0 Borderline Acceptable
42dc8573-e2d0-43484b-aaada2-c37ef865d041 Different results 39.0 Borderline Acceptable
fcd69a08-498f-4c4b47-989799-743cbe5fd960 Different results 36.5 Borderline Acceptable
4d1cae02-49e0-484c4e-8f8d8c-a97dec246310 Different results 41.0 Borderline Acceptable
a8319ceb-857c-434142-b9b8bc-7905684fc1d3 Different results 37.5 Borderline Acceptable
6d1acb45-cfb6-4b4441-87888f-e2a4756d30f9 Different results 38.0 Borderline Acceptable
0dea8527-2ab9-4e4844-88868e-8379e514ca2b Different results 40.0 Borderline Acceptable
7e94afc5-af0b-4a4c46-bebcb7-9f60d1e53a47 Different results 37.0 Borderline Acceptable
b719ef08-bdf5-474240-858d8a-a12bc65d349e Different results 41.5 Borderline Acceptable
  • Food Consumption Score have different categories, what threshold were used to compute the FCS? Maybe 28-42?
review$dataset %>% 
  filter(!review_check_hhs_cat.x) %>% 
  select(hhs_cat.x, hhs_cat.y) %>% 
  table(useNA = "ifany")
              hhs_cat.y
hhs_cat.x      Moderate No or Little Severe
  moderate           29            0      0
  no_or_little        0           58      0
  severe              0            0     13
  • HHS is fine. Labeling is different

impactR4PHU

The functions about public health checks and indicators are more updated on the impactR4PHU. More information on this repository.